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INTRODUCTION 


It  is  generally  agreed  that  regular  screening  by  mammography  is  a  woman's  best  strategy  for 
preventing  death  due  to  breast  cancer.  However,  mammography  quality  is  of  concern  for 
three  reasons.  First,  recent  evidence  of  variability  in  radiologists’  interpretations  of  the  same 
mammograms  suggests  that  improvement  is  needed  in  mammographers’  accuracy  in  reading 
films.1  Second,  growing  attention  to  issues  of  costs  and  cost-effectiveness  suggests  the 
importance  of  improving  specificity  in  reading  mammograms.2,3  Third,  the  efficacy  of 
screening  younger  women  remains  controversial. 

The  primary  objective  of  this  project  is  to  develop  a  comprehensive  mammography  quality 
improvement  program  (MQIP)  that  can  be  easily  disseminated  to  practicing  radiologists 
located  in  rural  areas.  The  project  focuses  on  rural  areas  because  these  communities  have 
been  identified  as  being  underserved  by  public  health  research.8, 9’ 10  Additionally,  there  may 
be  cause  for  concern  about  the  quality  of  care  offered  in  rural  areas.11 

The  Fred  Hutchinson  Cancer  Research  Center  (FHCRC),  the  Department  of  Radiology  at  the 
University  of  Washington  (UW),  and  the  Washington  State  Cancer  Registry  (WSCR)  at  the 
Department  of  Health  (DOH)  are  collaborating  to  develop  and  implement  the  MQIP  to 
demonstrate  its  feasibility  and  effectiveness  for  dissemination.  The  MQIP  emphasizes 
improvement  in  film  interpretation,  within  the  context  of  a  comprehensive  program  designed 
to  meet  the  requirements  of  the  Mammography  Quality  Standards  Act  (MQSA)  of  1994. 

The  MQIP  is  a  demonstration  project  and  consists  of  four  basic  functions.  It  employs  routine 
systematic  monitoring  of  measurable  outcomes  of  screening  mammography,  including 
sensitivity,  specificity,  and  positive  predictive  value.  This  is  referred  to  as  its  surveillance 
function.  It  also  identifies  for  mammographers  their  false  positive  and  false  negative  cases,  so 
that  they  can  improve  quality  through  review  of  their  own  films.  This  is  its  audit  function.  In 
addition,  it  provides  continuing  education  for  radiologists,  and  training  for  technologists,  as 
required  by  MQSA  as  well  as  training  for  registrars.  This  is  its  certification  function.  Most 
importantly,  it  incorporates  immediate  feedback  following  a  radiologist’s  interpretation  of 
practice  films  selected  for  their  educational  value.  This  is  its  continuous  quality  improvement 
(CQI)  function.  The  MQIP  is  comprehensive,  and  will  ensure  that  participating  facilities  are  in 
compliance  with  evolving  accreditation  rules. 

The  MQIP  builds  on  another  project  funded  through  the  National  Cancer  Institute  that  is  being 
conducted  at  the  FHCRC  entitled  the  Washington  Mammography  Tumor  Registry  (MTR) 
(Nicole  Urban,  P.I.).  The  MTR  is  a  registry  of  mammography  data  obtained  from  facilities  in 
Washington  State,  which  is  linked  to  tumor  data  obtained  from  the  WSCR  and  the  Puget 
Sound  Cancer  Surveillance  System.  The  purpose  of  this  registry  is  to  provide  a  resource  for 
research  into  mammography  performance  and  breast  cancer  in  addition  to  offering 
informational  reports  to  participating  radiologists  and  facilities.  The  MTR  will  be  used  to 
accomplish  the  surveillance  and  audit  functions  of  the  MQIP. 

A  research  study  is  being  conducted  within  the  MQIP  demonstration  project.  The  primary 
research  objective  is  to  determine  if  the  CQI  program  can  increase  the  accuracy  with  which 
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mammographers  interpret  films.  Secondary  research  objectives  are  to  1)  determine  inter¬ 
rater  variability  in  film  interpretation  in  a  set  of  films  selected  for  their  teaching  value,  before 
and  after  implementation  of  the  CQI  program;  2)  determine  post-CQI  intra-rater  variability  in 
film  interpretation;  3)  determine  if  digitized  films  can  be  interpreted  with  the  same  accuracy 
as  can  high-quality  copies  of  films;  and  4)  determine  if  the  accuracy  with  which  films  are 
interpreted  depends  on  covariates,  the  age  of  the  woman  being  of  particular  interest.  The 
availability  of  comparison  films  will  also  be  considered  as  a  covariate. 

This  three-year  project  is  currently  at  the  end  of  its  second  year. 

BODY 

Eighteen  major  tasks  were  identified  in  the  original  Statement  of  Work  as  being  imperative  to 
the  successful  completion  of  this  project.  These  tasks  are  listed  in  a  table  included  in 
Appendix  A.  Also  included  is  a  timeline  detailing  project  progress  during  Year  02  and  plans 
for  Year  03. 

Progress  in  the  CQI  Function  During  the  past  year,  the  primary  focus  of  project  work  has  been 
on  the  CQI  function  of  the  MQIP.  An  article  describing  the  design  of  the  study  has  been 
published  and  is  included  as  Appendix  B.  This  research  study  is  composed  of  5 
mammography-reading  sessions.  During  each  session,  a  participating  radiologist  will  read  a 
mammographic  film  and  provide  an  assessment.  The  radiologist  will  mark  his  or  her 
assessments  in  the  CQI  software  developed  specifically  for  this  project  and  will  receive 
feedback  from  the  program.  If  the  radiologist  identifies  a  malignancy,  s/he  must  indicate  on 
the  digitized  image  on  the  computer  screen  where  s/he  believes  the  malignancy  is  located. 

The  first  session  is  considered  the  “baseline”  score  for  the  physician,  and  the  fourth  session  is 
considered  the  follow-up  score.  Sessions  two  and  three  are  teaching  sessions  designed  to 
improve  the  radiologist’s  accuracy  in  reading  mammograms.  The  fifth  and  final  session 
varies  from  the  first  four  in  that  the  radiologist  will  only  be  allowed  to  read  the  digitized 
image  on  the  computer  as  opposed  to  having  films  available.  The  purpose  of  this  session  is  to 
assess  the  feasibility  of  disseminating  the  CQI  over  the  Internet.  Participating  radiologists 
will  receive  two  Continuing  Medical  Education  (CME)  credits  per  session  for  a  total  of  10 
credits. 

Project  radiologists  and  field  coordinators  have  spent  a  substantial  amount  of  time  this  past 
year  developing  and  implementing  methods  to  recruit  radiologists  and  mammography 
facilities.  As  described  in  the  manuscript  in  Appendix  B,  the  project  would  need  a  minimum 
of  30  radiologists  to  have  sufficient  power  to  detect  a  10%  change  in  sensitivity  and 
specificity  from  the  baseline  to  the  follow-up  scores.  After  approaching  ninety-four 
radiologists,  to  date,  37  have  been  signed  on  to  receive  the  intervention.  The  additional  seven 
radiologists  are  considered  a  safeguard  in  the  event  that  a  radiologist  drops  out  of  the  study 
prior  to  completing  all  sessions. 

Project  radiologists  also  spent  a  substantial  amount  of  time  locating  the  mammographic 
studies  that  would  compose  the  5  sessions.  Specific  criteria  for  film  selection  is  that  each  film 
be  sufficiently  difficult  to  read  so  that  the  overall  average  specificity  and  sensitivity  for  each 
session  developed  from  the  films  would  be  at  about  70%.  Locating  180  films  that  meet  these 
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criteria  has  been  particularly  challenging.  Project  staff  were  able  to  identify  locally  a 
sufficient  number  of  films  to  compose  the  test  sessions,  however  another  source  had  to  be 
located  to  provide  films  for  the  two  teaching  sessions  and  the  one  digitized  session.  After 
some  research,  a  large  mammography  reading  and  teaching  center  located  in  Rochester,  New 
York  was  contacted  and  has  agreed  to  provide  the  study  with  enough  cases  to  compose  the 
remaining  sessions.  These  cases  will  be  added  to  the  software  in  the  early  part  of  Year  03. 

To  assure  the  success  of  the  CQI  a  pretest  and  pilot  were  conducted  in  the  last  half  of  Year  02. 
Five  radiologists  participated  in  the  pretest  where  they  were  asked  to  review  90 
mammographic  studies  and  four  radiologists  participated  in  the  pilot  and  reviewed  a  set  of  45 
studies.  Each  participant  would  be  shown  a  mammogram  and  would  see  a  digitized  copy  of 
that  mammogram  on  a  laptop  PC.  Using  the  PC,  the  radiologist  would  record  his/her 
assessment  of  the  mammogram.  If  the  radiologist  saw  a  possible  malignancy  in  the 
mammogram,  they  would  then  “click”  on  the  area  on  the  digitized  image  indicating  where 
they  saw  something.  This  information  would  then  be  recorded.  At  the  conclusion  of  the 
session,  each  radiologist  in  the  pilot  then  reviewed  each  case  and  the  accompanying 
educational  text  and  provided  feedback  to  the  project  about  the  quality  of  the  case  as  well  as 
the  description  of  it. 

Results  from  all  participants  of  both  the  pretest  and  pilot  were  then  combined  and  reviewed  by 
project  investigators.  The  average  sensitivity  of  the  pretest  and  pilot  combined  was  67.8%  and 
the  average  specificity  was  77.1%.  These  results  assisted  the  project  in  assuring  that  the 
overall  baseline  sensitivity  and  specificity  met  the  study  requirement  of  being  in  the  range  of 
70%. 

Progress  in  the  Surveillance  and  Audit  Functions  The  MTR  is  being  used  to  address  these  two 
functions.  Adding  facilities  to  the  MTR  is  a  very  laborious  task  involving  a  great  deal  of 
interaction  between  the  MTR  and  facility  staff,  as  is  demonstrated  in  the  flow  chart  included 
in  Appendix  C.  The  overall  process  of  adding  a  single  facility  to  the  MTR  can  take  many 
months  depending  on  the  type  of  system  that  they  maintain  their  data  in  and  the  overall 
quality  of  the  data. 

Facility  recruitment  to  join  the  MQIP  has  been  ongoing  throughout  Year  02.  Twenty-seven 
facilities  providing  mammography  services  to  rural  Washington  were  originally  identified  and 
contacted.  Of  these  27,to  date  8  facilities  have  signed  agreements  to  provide  mammography 
data  to  the  MTR  and  three  have  refused  participation.  The  remaining  16  facilities  are  in  the 
process  of  deciding  whether  or  not  to  participate. 

Several  of  the  8  participating  mammography  facilities  have  already  provided  the  MTR  with 
their  initial  download  of  data.  Project  programmers  are  working  with  these  facilities  to 
validate  and  clean  their  data  before  the  final  link  to  the  cancer  registry  data.  Once  this  is 
complete,  surveillance  and  audit  reports  will  be  generated.  The  participating  facilities  are 
expected  to  receive  their  initial  reports,  including  audit  reports  specifically  for  radiologists, 
during  the  first  half  of  Year  03. 

Two  of  the  8  facilities,  which  have  been  unable  to  provide  us  with  electronic  data,  have 
participated  in  our  data  collection  by  providing  us  with  their  data  via  mammography  forms 


7 


(see  Appendix  D).  These  forms,  which  we  receive  monthly  from  the  facilities,  have  been 
created  specifically  for  the  purpose  of  collecting  data  from  facilities  that  are  interested  in  our 
study  and  desire  our  feedback  but  are  unable  to  provide  us  with  their  data  electronically. 

Project  staff  are  working  with  the  remaining  facilities  who  do  have  retrospective  electronic 
data  to  obtain  initial  downloads.  Two  facilities  are  working  with  their  software  vendor  for  the 
purpose  of  creating  extraction  programs  that  will  simplify  this  process  for  clinic  staff. 

At  the  conclusion  of  the  MQIP,  it  is  anticipated  that  all  facilities  recruited  for  the  surveillance 
function  will  remain  as  members  of  the  MTR. 

Progress  in  the  Certification  Function  The  first  training  conference  for  mammography 
technologists  was  developed  and  presented  during  Year  02  of  the  project.  Attending 
technologists  received  up  to  eight  Category  A  credits  from  the  American  Society  of 
Registered  Technicians  (ASRT).  Of  the  140  technologists  working  in  facilities  that  were 
solicited  for  participation  in  the  MQIP,  53  attended  the  conference. 

The  second  training  conference  took  place  in  October  1998.  It  contained  many  sessions 
similar  to  the  first  conference,  but  added  a  few  new  topics  based  on  feedback  from  the  original 
conference.  To  be  certain  that  all  eligible  technologists  had  a  reasonable  opportunity  to  attend 
at  least  one  of  the  conferences,  the  second  conference  was  held  in  Eastern  Washington.  This 
conference  was  attended  by  59  technologists  representing  14  facilities,  and  was  very  well 
received.  The  table  below  summarizes  the  overall  response  to  each  session  based  on  . 
evaluation  forms  completed  by  participants. 


Evaluation  of  Technologist  Training  Sessions  (Overall,  how  satisfied  were  you?) 


Session 

%  Very  Satisfied 

%  Satisfied 

%  Other 

Session  1 

Session  2 

Session  1  Session  2 

Session  1  Session  2 

Anatomy  and  Pathology  Lecture 

82% 

90% 

16% 

10% 

2% 

Problem  Solving  and  Practical 
Application  Lecture 

77% 

67% 

17% 

33% 

6% 

Pattern  Recognition  and 
Pathological  Changes 

Workshop 

37% 

41% 

22% 

Critical  Analysis  Lecture 

82% 

88% 

18% 

12% 

Positioning  Workshop 

86% 

86% 

10% 

14% 

4% 

Problem  Solving  Workshop 

75% 

35% 

23% 

10% 

2% 

Nuclear  Medicine  Lecture 

38% 

56% 

6% 

Pathological  Changes 

81% 

19% 

Delayed  Diagnosis  of  Breast 
Malignancies 

82% 

Applications  to  Breast  Ultrasound 
Lecture 

68% 

32% 

In  addition*  the  MQIP  in  coordination  with  the  WSCR  sponsored  three  Washington  State 
Registrar’s  training  conferences  during  Year  02.  The  MQIP  was  responsible  for  describing 
the  importance  of  quickly  and  accurately  documenting  breast  cancer  cases  and  the  importance 
of  using  the  TNM  staging  system  to  accurately  stage  tumors.  Additionally,  the  MQIP 
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provided  registrars  with  an  example  of  how  the  work  that  they  did  was  put  to  use  for  research 
purposes.  These  conferences  were  also  well  received.  The  last  training  conference  took  place 
in  June  1998.  Beginning  February,  1999,  the  project  will  begin  to  evaluate  the  impact  of  the 
training  for  registrars  by  reviewing  the  quality  of  data  being  entered  into  the  registry 
(particularly  how  many  cases  have  TNM  staging  associated  with  them),  and  how  quickly 
these  cases  make  it  into  the  registry  once  diagnosed. 

During  Year  02,  MQIP  staff  explored  the  possibility  of  getting  the  MQIP  program  certified  by 
the  State  of  Washington.  We  had  originally  proposed  to  obtain  this  certification  to  assure 
MQIP  participants  that  any  data  collected  for  the  program  would  be  confidential,  used  only 
for  purposes  of  quality  improvement,  and  protected  from  subpoena.  However,  the  state 
certification  has  been  determined  to  be  appropriate  for  single  institution  programs  only.  As 
the  MQIP  is  working  with  multiple  facilities,  it  is  not  possible  to  meet  certain  requirements 
such  as  including  regular  meetings  of  participants  to  discuss  the  care  that  they  provide.  Instead 
of  state  certification,  the  project  has  obtained  a  similar  federal  protection  that  is  more  specific 
to  research  activities  through  a  federal  Certificate  of  Confidentiality.  This  document  protects 
from  subpoena  data  which  contain  sensitive  information  including  patient-identified 
information  and  provider  information. 

CONCLUSIONS 

The  past  year  has  been  very  productive  for  this  project.  Considerable  progress  was  made  in 
all  functions  of  the  MQIP.  As  part  of  the  CQI  function,  challenges  in  recruitment  issues  were 
met,  mammography  films  that  compose  the  test  sessions  were  obtained  and  prepared,  and  the 
sessions  were  piloted.  The  CQI  is  on  track  to  be  completed  and  evaluated  by  the  end  of  Year 
03.  Eight  mammography  facilities  were  also  recruited  to  the  project  and  are  in  various  stages 
of  transferring  data  and  receiving  reports  that  compose  both  the  surveillance  and  audit 
functions  of  the  MQIP.  Multiple  training  sessions  with  mammography  technicians  and 
registrars  were  conducted  as  part  of  the  certification  function  of  the  MQIP.  The  project  also 
explored  the  possibility  of  getting  certification  for  the  MQIP  from  Washington  State,  but 
obtained  the  federal  Certificate  of  Confidentiality  instead. 

Because  the  project  is  still  in  the  data  collection  phase,  there  are  no  results  to  report.  As  is 
demonstrated  in  the  timeline  included  in  Appendix  A,  the  majority  of  evaluation  activities 
will  be  conducted  during  Year  03. 
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APPENDIX  A 


Summary  of  Major  Tasks 
Associated  with  Project  and 
Detailed  Timeline  of  Project  Years  02  and  03 


Major  tasks  listed  in  original  Statement  of  Work 


Function 
associated 
with  task 

Major  task 

Progress 

All 

1.  Recruit  and  enroll  radiologists  and 
mammography  facilities  to  MQIP 

Radiologist  recruitment  completed 
Year  02,  facility  recruitment  ongoing 

CQI 

2.  Obtain  CME  credit  for  CQI 

Complete,  Year  01 

CQI 

3.  Obtain  1 80  mammograms  for  5 
sessions  of  CQI 

Accumulated  films  for  test  sessions 
(1  and  4)  during  Years  01  and  02. 

Will  complete  accumulation  for 
training  sessions  (2,  3,  and  5)  in  first 
part  of  Year  03. 

CQI 

4.  Develop  software  for  CQI 

Complete,  Year  01 .  Debugging 
occurred  during  pretest  and  pilot  in 
Year  02. 

CQI 

5.  Pilot  CQI 

Conducted  Pretest  and  Pilot,  Year  02 

CQI 

6.  implement  CQI 

Scheduled  for  initiation  and 
completion,  Year  03 

Surveillance 

7.  Develop  materials  to  allow  facilities 
without  computerized  systems  to 
participate  in  MTR 

Complete,  Year  01 

8.  Obtain  certification  for  training 
technologists 

Complete,  Year  01 

Certification 

9.  Conduct  training  workshops  for 
technologists 

Two  training  workshops  conducted 
during  Year  02. 

All 

10.  Apply  for  certification  of  MQIP  by 
Washington  State 

A  federal  Certificate  of  Confidentiality 
was  obtained,  Year  02. 

All 

11.  Implement  MQIP 

Initiated  Year  01 ,  will  be  complete 

Year  03 

Surveillance/ 

Audit 

12.  Link  mammography  data  to  tumor 
registry  via  MTR 

Initiated  Year  02.  Ongoing  through 
Year  03. 

Audit 

13.  Provide  feedback  reports  to 
participants 

Scheduled  Year  03. 

CQI 

14.  Evaluate  impact  of  CQI  on  accuracy 
of  interpretation  in  communities 

After  implementation  of  CQI.  Will  be 
done  in  latter  half  of  Year  03 

CQI 

15.  Evaluate  inter-/intra-  observer 
variability 

After  implementation  of  CQI.  Will  be 
done  in  latter  half  of  Year  03 

CQI 

16.  Evaluate  adequacy  of  digitized  films 

After  implementation  of  session  05. 

Will  be  done  in  latter  half  of  Year  03 

Certification 

17.  Evaluate  impact  of  training  CTR’s 
on  %  of  cancer  cases  entered  in 
tumor  registry  and  quality  of  data 

Last  CTR  training  held  in  Year  02. 

This  will  be  done  in  beginning  half  of 
Year  03 

Certification 
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are  read  A  simple  randomized  design  is  suggested  in  which  a  relatively  large  group  of  readers  read  sets  of  m. 

in  the  context  of  such  studies:  (i)  the  choice  of  primary  outcome  measure;  (..)  the  data  analysts  tec  q 
be  employed;  and  (iii)  the  methodology  for  calculating  sample  sizes  for  readers  and  ,m,^  u  b 

Fjrsr  Je  areue  in  fovor  of  using  sensitivity  and  specificity  as  the  primary  outcome  measures  rather  than  rece 
operating  characteristic  (ROC)  curves  in  mammography  studies,  although  the  latter  are  const  ere  state  o  e 

■ZZ ZlZ  these  measures  and  allows  for  estimation  of  joint  effects  on  them.  Fmally  we 
Z Suce  complexicv  into  power  calculations.  The  simulation  method  that  we  propose-  accommodates  such 
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efficacv  of  an  eduStional  intervention.  In  the  context  of  this  study  we  illustrate  the  steps  involved  in  1* 
calculations  and  apply  the  data  analytic  techniques  to  the  start  of  data  expected  to  result  from  th.s  «udyJhough 
the  proposed  methods  were  motivated  by  this  particular  study,  the  statistical  considerations  ^ 

broadly  m  mammography  and  indeed  in  other  types  of  radiologic  imaging  stud.es.  Standards  for  the  conduct 
mdioloe  c  reading  smdies  are  not  yet  well  developed,  as  they  are  for  randomized  chn.cal  trials  and  for  case- 
controfstudies.  We  hope  that  the  discussion  in  this  paper  will  add  to  the  dialogue  necessary-  for  c  eve  opment 
of  such  standards. ,  cun  ehdem.ol  50;12:1327-1338,  1997-  ©  1997  Elsevier  Sc.ence  Inc. - - - - 
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1.  INTRODUCTION 

Mammography  screening  for  breast  cancer  has  been  shown 
to  be -associated  with  decreased  breast  cancer  mortality,  at 
least  in  women  over  the  age  of  50  years  [1].  Major  efforts 
are  currently  underway  to  improve  participation  by  women 
in  screening  programs  [2].  Nevertheless,  there  is  concern 
about  the  quality  of  mammography  screening  and  there  is 
general  agreement  that  improvements  in  quality  may  lead 
to  improvements  in  the  performance  of  mammography  as 
a  screening  modality.  Quality  might  be  improved  for  exam¬ 
ple  by  improving  the  imaging  procedures.  Alternatively,  im- 
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Cancer  Research  Center,  Program  in  Biostatistics,  1124  Columbia  Street, 
MP^5,  Seattle,  Washington  98104. 

Accepted  for  publication  on  20  August  1997. 


provements  in  the  accuracy  with  which  mammographers  in¬ 
terpret  mammograms  may  improve  the  performance  of 
screening  mammography.  Recent  studies  [3,4]  have  shown 
that  there  is  considerable  variability  amongst  radiologists  in 
their  interpretations  of  screening  mammograms.  Elmore  et 
al.  [3]  observed  that  sensitivities  ranged  from  74%  to  96% 
and  that  specificities  ranged  from  35%  to  89%  among  10 
radiologists  reading  150  selected  mammograms.  Beam  et  al. 
[4]  using  a  much  larger  sample  of  108  radiologists,  each  read' 
ing  79  mammograms,  found  sensitivities  in  the  range  of  47- 
100%  and  specificities  in  the  range  of 35-99%.  These  obser¬ 
vations  suggest  that  improvement  in  interpretation  may  be 
possible. 

As  part  of  a  project  called  the  Mammography  Quality 
Improvement  Project  (MQIP)  funded  by  the  Department 
of  Defense  and  aimed  at  improving  the  quality  of  mammog- 
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raphy  screening  in  rural  communities,  we  are  developing  an 
educational  program  to  improve  the  accuracy  with  which 
radiologists  interpret  mammograms.  The  educational  inter¬ 
vention  is  composed  of  a  series  of  five  sessions  in  which 
mammographers  read  films  and  are  provided  with  immedi¬ 
ate  feedback  on  the  accuracy  of  their  interpretations.  Feed¬ 
back  is  provided  using  a  laptop  personal  computer  that  is 
mailed  to  the  radiologist  prior  to  his  reading  session.  The 
computer  program  emphasizes  the  particular  features  of  each 
mammogram  that  are  relevant  to  determining  the  disease 
status  of  the  woman  screened.  Eventually  it  may  be  possible 
to  disseminate  this  sort  of  intervention  over  computer  net¬ 
works  thus  making  it  attractive  in  terms  of  easy  accessibility 
and  low  cost. 

To  evaluate  the  impact  of  such  an  intervention  on  im¬ 
provements  in  diagnostic  accuracy  it  will  eventually  be  nec¬ 
essary  to  perform  a  study  of  radiologists’  interpretations  of 
screening  mammograms  in  their  actual  practices.  As  a  pre¬ 
liminary  step  to  such  a  large-scale  study,  we  will  evaluate 
the  intervention  effects  in  a  more  controlled  setting.  Spe¬ 
cifically,  we  will  have  a  number  of  radiologists  read  a  se¬ 
lected  set  of  mammograms  before  and  after  the  intervention 
and  evaluate  changes  in  accuracy.  The  mammograms  in¬ 
cluded  in  this  controlled  study  will  be  composed  of  about 
50%  from  women  with  disease,  a  proportion  xhuch  larger 
than  would  be  observed  in  practice  but  necessarily  high  to 
estimate  sensitivity  rates  in  a  small-scale  study.  Mammo¬ 
grams  will  be  selected  to  represent  a  reasonably  broad  range 
of  interpretive  difficulty. 

The  purpose  of  this  paper  is  to  elucidate  some  of  the  key 
statistical  issues  in  the  Resign  of  such  a  controlled  reading 
study.  Standards  for  the  design  of  such  studies  are  not  well 
developed.  This  contrasts  with  therapeutic  clinical  trials 
and  epidemiologic  studies  where  the  basic  elements  of  study 
design  are  now  fairly  well  standardized  [5].  The  question  we 
propose  to  address  in  this  reading  study,  namely  evaluation 
of  an  intervention  effect  in  a  controlled  setting,  is  a  stan¬ 
dard  sort  of  question  addressed  in  diagnostic  imaging  re¬ 
search.  Hence  the  design  issues  which  are  dealt  with  here 
will  have  implications  for  future  studies  in  mammography 
and  in  other  diagnostic  test  settings.  These  same  issues  also 
arise  in  reading  studies  designed  to  compare  different  im¬ 
aging  modalities.  The  key  issues  concern  the  choice  of  rele¬ 
vant  primary  outcome  measures,  appropriate  data  analysis 
strategies,  and  methodology  for  power  calculations  that  in¬ 
corporates  variability  among  radiologists  and  among  images. 
Broader  issues  in  regards  to  study  designs  for  evaluating  im¬ 
aging  tests  have  been  discussed  in  a  more  general  sense  in 
the  literature  [6,7]. 

In  Section  2,  we  consider  two  sets  of  measures  that  can 
be  used  to  define  accuracy  in  reading  mammograms;  first, 
sensitivity  and  specificity  and  second,  ROC  curves.  We  ar¬ 
gue  in  favor  of  the  former,  in  part,  because  they  are  more 
clinically  relevant  and  most  easily  understood,  but  also  be¬ 
cause  the  latter  can  provide  inappropriate  conclusions  con¬ 


cerning  intervention  benefits.  In  Section  3,  we  detail  the 
basic  elements  of  the  statistical  design  of  our  study  that 
could  be  considered  a  prototype  for  evaluating  intervention 
effects  in  diagnostic  radiology.  An  approach  to  joint  analy¬ 
sis  of  sensitivity  and  specificity  is  outlined  in  Section  4.  In 
Section  5,  we  describe  methodology  for  power  calculations 
that  are  appropriate  for  the  proposed  design  and  analysis. 
We  propose  the  use  of  computer  simulation  methods  for 
calculating  power  because  they  allow  for  complex  designs 
and  can  easily  incorporate  variability  amongst  radiologists 
and  images.  Having  described  the  steps  involved  in  calculat¬ 
ing  power  in  Section  5,  we  then  apply  these  procedures  to 
the  proposed  MQ1P  study  in  Section  6,  in  order  to  illustrate 
the  methods.  Concluding  remarks  follow  in  Section  7. 

2.  MEASURES  OF  ACCURACY 
2.1  Definitions 

A  radiologist  reading  a  set  of  mammograms  for  a  woman  in 
our  study  will  classify  each  breast  according  to  his  or  her 
suspicion  of  its  showing  malignancy.  The  ACR  lexicon  for 
rating  a  breast  [8]  which  we  will  employ,  defines  a  5 -point 
scale  with  category  1  indicating  “normal,  routine  follow-up 
recommended,”  2  indicating  “benign,  routine  follow-up,”  3 
indicating  “probably  benign,  early  recall  recommended,”  4 
indicating  “suspicious  for  cancer,  consider  biopsy,”  and 
5  indicating  “highly  suspicious  for  cancer,  biopsy  recom¬ 
mended.”  A  common  definition  of  a  screen  positive  mam¬ 
mogram  is  one  that  receives  a  rating  of  4  or  greater.  These 
are  mammograms  that  are  sufficiently  suspicious  for  cancer 
that  biops^s  recommended  and  hence  they  have  an  impact 
on  clinical  practice.  Sometimes  a  rating  of  a  3  or  greater  is 
considered  positive.  Because  of  the  clinical  implications  of 
ratings  4  and  5,  we  will  focus  on  the  positivity  criterion  of 
category  ^4  here. 

Given  a  definition  for  screen  positivity,  since  there  is  a 
rating  for.  each  breast,  one  can  calculate  sensitivities  and 
specificities  with  either  “woman”  or  “breast”  as  the  unit  of 
analysis.  The  latter  includes  all  non-diseased  breasts  (in¬ 
cluding  non-diseased  breasts  from  women  with  cancer),  as 
the  denominator  for  specificity  and  all  diseased  breasts  as 
the  denominator  for  sensitivity.  However,  since  the  conse¬ 
quences*  of  false  positive  and  false  negative  errors  relate  to 
the  woman  (rather  than  the  breast),  it  seems  more  clinically 
relevant  to  use  woman  rather  than  breast  as  the  unit  of  anal¬ 
ysis.  Thus,  for  example,  we  count  the  proportion  of  women 
with  disease  who  have  it  detected  as  the  sensitivity,  rather 
than  defining  the  sensitivity  to  be  the  proportion  of  diseased 
breasts  which  are  detected.  This  accords  with  previous  liter¬ 
ature  [3].  One  could  use  the  maximum  of  the  ratings  for  the 
left  and  right  sides  as  the  woman  level  rating  for  calculation 
of  sensitivity  and  specificity.  Occasionally,  however,  a 
woman  with  unilateral  disease  may  not  have  it  detected  in 
the  affected  side  but  will  have  a  positive  mammogram  on 
the  unaffected  side.  In  this  case,  using  the  maximum  rating 
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will  inappropriately  inflate  the  sensitivity.  We  define  sensi¬ 
tivity  instead  as  the  proportion  of  women  with  disease  who 
have  it  detected  (a  rating  of  ^4)  on  the  affected  side.  The 
specificity  is  the  proportion  of  women  without  disease  who 
have  a  maximum  rating  of  less  than  4- 
ROC  analysis  is  a  statistical  technique  used  to  describe 
accuracy  of  diagnostic  tests  when  the  test  outcome  is  either 
ordinal  or  continuous  as  opposed  to  binary.  The  rating  data 
generated  in  radiology  reading  studies  are  ordinal  and  ROC 
analysis  is  often  considered  optimal  for  the  analysis  of  such 
studies  as  is  evidenced,  for  example,  in  a  recent  issue  of 
Academic  Radiology  [9].  An  ROC  curve  is  constructed  by 
varying  the  criterion  used  for  defining  a  positive  mammo¬ 
gram  from  “rating  >2”  to  “rating  >5,”  plotting  the  associ¬ 
ated  sensitivity  and  1-specificity  values  against  each  other, 
and  finally  fitting  a  curve  to  the  points  so  that  the  curve  is 
anchored  at  (0,0)  and  (1,1).  Various  algorithms  exist  for 
fitting  a  curve,  the  most  notable  being  the  Dorfman- A  If  al¬ 
gorithm  based  on  the  binormal  model  [10]  and  the  empirical 
nonparametric  method  that  simply  connects  observed  ROC 
points  linearly.  The  area  under  the  ROC  curve  is  usually 
used  to  summarize  accuracy.  Again  we  suggest  that  woman 
rather  than  breast  should  be  the  unit  of  analysis  in  defining 
the  ROC  curve.  That  is*  in  calculating  the  sen^tivity  corre¬ 
sponding  to  the  criterion  “rating  >  K ,”  it  should  be  defined 
as  the  proportion  of  women  with  cancer  who  have  a  rating 
of  ^  K  on  an  affected  side. 


FIGURE  1.  An  hypothetical  setting  where  the  sensitivity  and 
specificity  associated  with  the  clinically  relevant  criteria  are 
unchanged  but  the  empirical  ROC  curves  indicate  a  benefit 
of  intervention.  The  (false  positive,  true  positive)  points  as¬ 
sociated  with  categories  5, 4, 3,  and  2  are  (0.10, 0.30),  (0.25, 
0.70),  (0.45,  0.85),  and  (0.75,  0.95)  respectively,  pre-inter¬ 
vention;  and  (0.10,  0.60),  (0.25,  0.70),  (0.45,  0.85),  and 
(0.55,  0.95),  respectively,  post-intervention. 


2.2  ROC  Analysis  Versus  Sensitivity  and  Specificity 

ROC  analysis  was  developed  originally  for  diagnostic  tests 
with  results  on  some  arbitrary  scale.  Its  primary  advantage 
is  that  it  allows  one  to  assess  the  inherent  capacity  of  the 
test  to  distinguish  between  diseased  and  non-diseased  sub¬ 
jects  without  linking  the  test  to  some  particular  threshold 
for  defining  screen  positive  [11,12].  This  seems  appropriate 
in  radiology  experiments  when  image  ratings  are  arbitrary 
numbers  with  no  specific  clinical  meaning  attached  to 
them.  In  that  case,  shifts  in  the  distributions  of  ratings  are 
of  no  consequence  as  long  as  they  are  equally  shifted  for 
diseased  and  non-diseased  subjects.  In  mammography,  how¬ 
ever,  mammogram  ratings  have  very  specific  clinical  mean¬ 
ings  and  consequent  clinical  implications.  Uniform  shifts 
in  the  frequencies  with  which  rating  categories  are  chosen 
can  have  major  clinical  implications. 

Moreover,  in  contrast  to  the  prototype  setting  for  ROC 
analysis,  shifts  between  certain  diagnostic  categories  are  of 
more  importance  than  others.  For  example,  as  noted  by  Ko- 
pans  [13],  whether  an  image  is  rated  in  category  4  versus 
category  5  has  no  clinical  impact.  Similarly  classifications 
in  category  1  versus  category  2  are  clinically  irrelevant. 
However,  shifts  between  categories  4  or  5  and  between  1 
or  2  can  have  a  big  impact  on  the  ROC  analysis.  To  illus¬ 
trate  this  consider  the  setting  shown  in  Fig.  1.  The  effect 
of  intervention  in  this  setting  is  to  shift  classifications  of 


diseased  observations  from  category  4  to  category  5  and  clas¬ 
sification  of  non-diseased  patients  from  category  2  to  cate¬ 
gory  1.  Though  these  changes  are  of  no  clinical  import,  the 
ROC  type  analysis  indicates  a  benefit  for  the  intervention. 
Thus  an  ROC  analysis  can  indicate  a  benefit  of  intervention 
even  though  a  clinically  relevant  benefit  does  not  exist. 

Of  even  more  concern  is  the  fact  that  a  clinically  relevant 
benefit  of  intervention  can  occur  even  when  the  ROC 
curves  pre-  and  post- intervention  are  the  same.  Consider 
the  ROC  curve  depicted  in  Fig.  2  for  such  a  situation.  The 
location  on  the  ROC  curve  of  the  points  associated  with 
the  criterion  “rating  >  category  4’*  indicate  that  sensitivity 
was  significantly  increased  without  decreasing  specificity. 
This  clinically  relevant  improvement  in  test  accuracy  does 
not  manifest  itself  in  an  improvement  in  the  ROC  curves 
since  the  pre-  and  post-intervention  curves  are  the  same. 
(Interestingly,  classic  binormal  ROC  curves  do  not  fit  the 
situation  depicted  in  Fig.  2  and  a  binormal  ROC  analysis 
in  this  setting  may  incorrectly  indicate  that  the  ROC  curve 
post-intervention  is  improved  over  that  pre-intervention). 

The  fact  that  ROC  analysis  can  yield  inappropriate  con¬ 
clusions  regarding  the  clinically  relevant  effects  of  interven¬ 
tion  argues  against  its  use  for  the  primary  analysis  of  mam¬ 
mography  reading  study  data.  Another  valid  argument  for 
not  using  an  ROC  analysis  is  that  it  is  complicated  and 
not  easily  understood  by  clinicians.  Moreover,  the  so-called 
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FIGURE  2.  An  hypothetical  setting  where  ROC  curve  is  un¬ 
changed  by  the  intervention  but  there  is  a  clinically  relevant 
benefit.  The  sensitivity  associated  with  the  clinically  rele¬ 
vant  criterion  is  improved  from  0*50  to  0*70  while  the  associ¬ 
ated  false  positive  rate  remains  unchanged  at  0.09.  The 
(false  positive,  true  positive)  points  associated  with  catego¬ 
ries  5,  4,  3,  and  2  are  (0*03, 0.27),  (0*09, 0.50),  (0.15,  0.83), 
and  (0.39,  0.93)  pre-intervention  and  (0.03,  0.27),  (0.09, 
0.70),  (0.15, 0.83),  and  (0.39, 0.93 )  post-intervention.  These 
points  before  intervention  are  labeled  with  circles  and  after 
intervention  are  labeled  with  triangles. 

“area  under  the  curve”  that  summarizes  the  ROC  curve  in 
a  single  number  has  an  interpretation  that  is  not  well  known 
or  easily  understood.  It  can  be  interpreted  as  the  probability 
that  a  radiologist  will  have  a  greater  suspicion  of  cancer 
from  a  mammogram  from  a  woman  with  disease  than  from 
a  woman  without  [14].  This  probability,  however,  seems  to 
be  of  more  theoretical  than  practical  relevance. 

We  propose  using  the  more  clinically  meaningful  quanti¬ 
ties  of  sensitivity  and  specificity  for  the  primary  data  analysis 
and  employing  ROC  analysis  as  a  secondary  descriptive  de¬ 
vice.  Though  ROC  analysis  may  be  statistically  more  power¬ 
ful  in  some  settings,  statistical  power  is  of  secondary  impor¬ 
tance  relative  to  clinical  relevance.  Any  study  should  be 
designed  so  that  it  has  adequate  power  to  detect  changes  in 
the  quantities  that  are  of  practical  relevance.  Hence,  we 
suggest  that  power  calculations  for  a  mammography  reading 
study  should  be  based  on  the  ability  to  detect  changes  in 
sensitivity  and  specificity  rather  than  on  the  basis  of  de¬ 
tecting  changes  in  ROC  curves. 

3.  STUDY  DESIGN 

We  now  describe  the  basic  elements  of  the  design  that  we 
propose  for  studies  evaluating  intervention  effects  on  read¬ 
ing  accuracy  in  mammography.  In  this  prototype  design,  ra¬ 


diologists  are  randomly  assigned  to  intervention  and  control 
groups,  with  the  number  in  the  former  being  denoted  by  Rr 
and  the  number  in  the  latter  denoted  by  Rc.  Two  image 
sets  are  constructed  with  M  images  in  each  set  S  —  1,2.  In 
set  S,  a  number  Md  are  from  women  with  disease  and  this 
number  may  differ  between  the  two  sets.  Each  reader  reads  j 
one  set  of  images  before  the  intervention  period  and  one 
set  after.  It  is  important  that  the  sets  before  and  after  inter¬ 
vention  be  different  since  readers  may  remember,  to  some 
degree,  images  that  they  have  previously  read.  Half  of  the 
readers  chosen  at  random  in  each  of  the  intervention  and 
control  groups  read  set  1  before  intervention  and  set  2  after 
intervention.  The  other  half  read  them  in  the  opposite  or¬ 
der:  set  2  followed  by  set  1 .  This  cross-over  of  film  sets  elimi¬ 
nates  the  possibility  of  systematic  bias  due  to  film  sets.  The 
design  is  balanced  in  the  sense  that  set  1  is  read  equally 
often  before  and  after  the  intervention  phase  in  both  the 
intervention  and  control  groups,  and  similarly  for  set  2. 
Readers  are  told  the  approximate  prevalence  of  diseased  im¬ 
ages,  i.e.,  (M{>  +  Mp)/2M  and  that  this  varies  between  the 
two  sets.  The  rationale  for  telling  the  readers  the  approxi¬ 
mate  prevalence  is  that  it  will  become  apparent  in  any  case 
after  reading  the  first  set  of  images  and  chat  a  priori  knowl¬ 
edge  of  it  should  reduce  the  potential  impact  as  much  as 
possible  on  the  observed  improvement  in  accuracy.  Readers 
will  use  the  ACR  lexicon  to  classify  mammograms  and  for 
each  reading  it  will  be  determined  if  it  is  screen  positive  or 
negative  according  to  whether  the  rating  is  at  least  4  or  less 
than  4. 

Images  fot  inclusion  in  the  study  need  to  be  selected  so 
that  average  sensitivity  and  specificity  at  the  baseline  assess¬ 
ment  are  relatively  low.  That  is,  improvements  in  accuracy 
should  be  possible  with  the  sets  of  images  chosen.  If,  in  the 
absence  of  intervention  all  images  from  women  with  disease 
were  easily  identified  as  such,  the  observed  sensitivities  pre- 
and  post-intervention  would  be  close  to  1  and  a  change  in 
sensitivity  would  not  be  identifiable  regardless  of  the  actual 
effect  of  intervention.  Thus  at  least  some  of  the  diseased 
images  should  be  difficult  but  not  impossible  to  identify  as 
being  from  women  with  disease.  Analogous  considerations 
apply  to  specificity  and  the  choice  of  non-diseased  images 
included  in  the  study. 

4.  DATA  ANALYSIS 

Having  described  the  basic  elements  of  the  design  and  the 
choice  of  primary  outcomes,  we  turn  now  to  the  strategy 
for  data  analysis.  There  are  two  components  to  the  analysis. 
The  first  concerns  a  comparison  of  post-  versus  pre-inter¬ 
vention  reading  accuracy  among  the  Rr  readers  in  the  inter¬ 
vention  group.  The  second  is  the  comparison  of  changes 
from  pre-  to  post-intervention  between  the  intervention 
and  control  groups.  We  first  consider  the  former  analysis, 
in  part  because  it  allows  us  to  define  notation  most  easily. 

The  purpose  of  this  data  analysis  is  to  compare  the  overall 
‘sensitivity  pre-intervention  with  that  post-intervention 
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and  to  compare  the  overall  specificity  pre-intervention  with 
that  post-intervention.  If  SrtPre  and  Sr4mt  denote  the  observed 
pre-  and  post- intervent  ion  sensitivities  for  radiologist  r, 
then  the  observed  change  in  the  overall  sensitivity  ^(sen¬ 
sitivity)  is  the  average  change  in  sensitivities  across  radiolo¬ 
gists  in  the  intervention  group: 

Rt 

l  \ - '  -  A 

iT(sensitivity)  -  —  /  (S^w  “  Sr,r,c). 

Rt 

r=l 


Similarly  the  observed  change  in  the  overall  specificity  in 
the  intervention  group  is 


Rr 


1  \ - '  *  A 

ir(specificity)  =  —  /  (Fr.p«i  ~  FTtPfJ 


where  Fr,prc  and  Fr^t  denote  the  observed  pre-  and  post¬ 
intervention  specificities  for  radiologist  r.  Variance  estima¬ 
tors  for  ^(sensitivity)  and  4T(spedficity)  are  provided  in 
the  appendix.  Although  ^(sensitivity)  and  Ar(specificity) 
are  sample  means  of  changes  in  sensitivities  and  specificities, 
their  variances  are  not  given  by  the  usual  variance  formulae 
for  sample  means.  Indeed  such  sample  variances  would  over¬ 
estimate  the  variability.  Rather  the  correct  variance  estima¬ 
tors  rely  on  acknowledging  that  there  are  in  essence  two  strata 
of  radiologists  in  the  design,  which  are  define  J^by  the  order¬ 
ing  of  the  two  image  sets  which  are  rated.  The  variances  of 
insensitivity)  and  ^(specificity)  are  averages  of  stratum- 
specific  variances,  as  shown  in  Appendix  A. 

Sensitivity  and  specificity  are  highly  correlated  parame¬ 
ters.  Radiologists  with  high  sensitivities  tend  to  have  low 
specificities.  This  will^happen  for  example  if  they  have  a 
low  threshold  for  classifying  images  as  diseased.  Similarly, 
changes  in  sensitivities  and  specificities  induced  by  the  in¬ 
tervention  may  be  highly  correlated.  In  particular,  if  the 
intervention  simply  changes  the  implicit  threshold  a  radiol¬ 
ogist  has  for  classifying  a  mammogram  as  diseased  then  the 
sensitivity  and  specificity  will  both  be  changed.but  in  oppo¬ 
site  directions.  Thus  it  is  important  to  assess  joint  effects  of 
intervention  on  sensitivity  and  specificity  and  to  account 
for  correlations  between  them  in  making  inference.  This 
can  be  accomplished  by  employing  a  bivariate  analysis  ap¬ 
proach  which  is  a  special  case  of  multivariate  analysis,  and 
for  which  there  is  a  large  statistical  literature  [15].  Using- 
this  approach  to  test  the  hypotheses  that  the  true  average 
sensitivity  and  specificity  are  unchanged  by  the  intervene 
tion,  H0:  /^(sensitivity)  =  4r(specificity)  =  0,  a  chi-square 
test  statistic  is  calculated.  This  statistic  is  a  function  of  the 
observed  average  changes,  ^(sensitivity)  and  ^(specifi¬ 
city),  their  variances  and  also  their  correlation.  An  expres¬ 
sion  for  the  chi-squared  statistic  is  provided  in  the  Appendix. 

In  addition  to  simply  testing  the  hypothesis  of  no  inter¬ 
vention  effect,  it  will  be  important  to  provide  a  confidence 
region  for  the  intervention  effects  on  sensitivity  and  speci¬ 
ficity  based  on  the  observed  data.  That  is,  a  range  of  inter¬ 
vention  effects,  {AT(sensitivity),  ^(specificity)},  which  are 
consistent  with  the  observed  data.  Such  a  joint  95%  confi¬ 


dence  region  is  defined  formally  as  the  set  of  values  (x9y) 
for  which  the  hypothesis  Ho:  {^(sensitivity)  =  x,  ^(spec¬ 
ificity)  =  y }  is  not  rejected  at  the  5%  significance  level. 
This  region  is  an  ellipse,  centered  at  the  observed  interven¬ 
tion  effect  (Aj( sensitivity),  ^(specificity)).  We  refer  the 
interested  reader  to  the  text  [15]  by  Johnson  and  Wichem 
(1988,  section  5.2)  for  technical  details  regarding  its  calcu¬ 
lation.  Code  for  calculating  such  regions  has  been  written 
by  Murdoch  and  Chow  for  the  S-PLUS  statistical  software 
package  and  can  be  obtained  from  the  S-archive  on  the 
Statlib  computer  site  (http://lib.stat.cmu.edu).  In  a  similar 
fashion  a  joint  confidence  region  for  the  overall  average 
sensitivity  and  specificity  pre-  or  post- intervent  ion  can  be 
calculated.  It  is  calculated  using  the  observed  radiologist 
specific  sensitivities  and  specificities  pre-  and  post- interven¬ 
tion,  and  requires  only  calculation  of  the  means,  variances 
and  correlations  for  these  parameters.  To  illustrate  these 
analyses,  Fig.  3  displays  joint  confidence  regions  based  on 
a  simulated  data  set.  In  our  opinion  these  confidence  regions 
provide  a  simple  summary  of  the  information  contained  in 
study  data  regarding  intervention  effects  on  reading  accu¬ 
racy.  In  the  simulated  data,  the  analyses  show  that  sensitiv¬ 
ity  was  increased  by  the  intervention  whereas  there  is  no 
evidence  of  change  in  specificity. 

!  So  far  we  have  considered  the  comparison  of  post-  versus 
‘pre-intervention  reading  accuracy  within  the  intervention 
group.  To  attribute  changes  in  accuracy  to  the  intervention 
it  will  be  necessary  to  compare  the  changes  in  the  interven¬ 
tion  group  with  those  in  the  control  group.  Without  the 
control  group  comparison,  observed  changes  might  be  at¬ 
tributed  to  other  factors,  such  as  the  increased  reading  prac¬ 
tice  or  increased  awareness  of  reader  fallibility  induced  by 
participation  in  the  study.  Thus,  turning  now  to  the  com¬ 
parison  of  intervention  and  control  groups,  the  main  hy¬ 
pothesis  to  be  tested  is  that  the  changes  in  sensitivity  and 
specificity  in  the  intervention  group  are  the  same  as  those 
in  the  control  group.  Using  a  subscript  T  to  denote  the  in¬ 
tervention  group  and  subscript  C  to  denote  the  control 
group,  the  null  hypothesis  is  H0l  Ac  (sensitivity)  =  ^(sen¬ 
sitivity),  4:(specificity)  =  ^(specificity ).  A  test  statistic 
that  has  a  chi-square  distribution  with  2  degrees  of  freedom 
is  described  in  the  appendix  for  testing  this  hypothesis.  Joint 
confidence  regions  for  the  differences  in  changes  between 
the  groups,  namely  Ar(sensitivity)  -  ^(sensitivity)  and 
^(specificity)  -  ^(specificity),  can  be  calculated  using 
methods  analogous  to  those  described  earlier  for  the  pre- 
versus-post-intervention  comparison. 

5.  METHODOLOGY 

FOR  POWER  CALCULATIONS 

Power  calculations  for  the  reading  study  are  somewhat  com¬ 
plicated.  They  must  accommodate  the  facts  that  readers 
vary  in  their  accuracy  parameters  of  sensitivity  and  specific¬ 
ity,  that  their  sensitivities  and  specificities  are  likely  nega¬ 
tively  correlated,  that  images  vary  in  difficulty  -and  that  a 
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FIGURE  3.  Joint  confidence  regions  for  sensitivity  and  speci¬ 
ficity  both  pre  and  post  intervention  (upper  panel)  along 
with  a  joint  confidence  region  (lower panel)  for  the  changes 
in  these  parameters.  Data  used  in  this  illustration  were  gen¬ 
erated  using  computer  simulation  methods  described  in  sec¬ 
tions  5  and  6.  Points  correspond  to  observed  data  for  individ¬ 
ual  radiologists. 
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bivariate  analysis  approach  will  be  employed.  These  factors 
together  make  analytic  expressions  for  sample  size  intracta¬ 
ble.  We  instead  take  a  computer  simulation  approach  to 
power  calculations.  The  simulation  approach  to  power  cal¬ 
culation  is  a  general  and  standard  method  and  indeed  soft¬ 
ware  has  been  developed  for  certain  types  of  applications 
[16].  The  basic  idea  is  to  repeatedly  simulate  data  as  it  is 
expected  or  hoped  to  arise  in  the  course  of  the  study,  and 
determine  how  often  the  null  hypothesis  is  rejected.  By 
definition  the  statistical  power  of  the  study  is  the  proportion 
of  simulated  studies  in  which  the  null  hypothesis  is  rejected. 
One  calculates  the  power  in  this  fashion  using  various  sam¬ 
ple  sizes  until  a  sample  size  is  found  that  provides  adequate 


power.  This  indirect  computer  intensive  approach  to  sample 
size  calculation  is  easily  accomplished  with  modem  com¬ 
puters. 


5.1  Models  for  Pre -  and  Post-intervention  Accuracy 

To  simulate  study  data  we  need  to  define  precisely  the 
mechanisms  giving  rise  to  the  data.  We  therefore  need  to 
make  assumptions  about  the  reading  accuracies  before  and 
after  intervention.  For  this  purpose  we  suppose  that  before 
intervention  a  reader  correctly  assesses  a  woman  with  tumor 
as  being  diseased  with  probability  PrDr  The  probability  PrD, 
depends  on  the  image  denoted  by  i  and  on  the  reader,  de¬ 
noted  by  r.  The  probabilities  PrDi  will  presumably  be  higher 
if  the  tumor  is  clearly  visible  in  image  i  than  if  it' is  not. 
The  probabilities  will  also  be  higher  if  the  radiologist  is  con¬ 
servative  and  is  inclined  to  recommend  biopsy  for  border¬ 
line  cases.  We  let  SD  be  the  sensitivity  of  the  average  radiol¬ 
ogist  to  the  average  film  from  a  woman  with  tumor.  The 
variability  among  films  in  terms  of  the  difficulty  that  readers 
have  in  assessing  them,  is  captured  by  specifying  a  distribu¬ 
tion  for  the  sensitivities  that  the  average  reader  has  in  as¬ 
sessing  the  films.  Here  we  assume  th$t  the  average  reader’s 
sensitivity  to  films  varies  uniformly  in  an  interval  (SD  -*■  aD> 
SD  +  aD)  across  different  films.  Thus  for  the  average  radiolo¬ 
gist,  easier  films  are  read  with  sensitivity  closer  to  SD  +  cP 
and  more  difficult  films  are  read  with  sensitivity  closer  to 
SD  —  aD.  In  a  similar  fashion,  on  the  average  film  from  a 
diseased  woman,  the  sensitivity  of  different  readers  is  as¬ 
sumed  to  var^  uniformly  in  an  interval  (SD  —  fcD,  SD  +  b°) 
across  radiologists.  Thus  radiologists  with  high  sensitivity 
to  the  average  film  will  have  sensitivity  closer  to  SD  + 

In  the  appendix  we  detail  a  logistic  model  with  random  ef¬ 
fects  (also  called  a  mixed  model)  for  the  probabilities  P% 
that  give  rise  to  inter-image  and  inter-reader  variability  as- 
postulated  here.  It  is  assumed  that  on  the  logistic  scale  there 
are  no  interactions  between  reader  and  image  specific  effects 
on  the  sensitivity. 

Observe  that  for  the  purposes  of  simulating  data,  by  speci¬ 
fying  S°  and  aD  we  can  now  generate  a  random  image  effect 
by  choosing  a  random  number  in  ( SD  ±  cP )  that  corresponds 
to  the  sensitivity  an  average  radiologist  has  for  detecting  it.* 
Similarly,  having  a  specified  SD  and  bD  we  are  in  a  position 
to  generate  a  random  reader  effect  by  choosing  a  random 
number  in  ( SD  —  hP,SD  +  bP)  that  corresponds  to  his  sensi¬ 
tivity  to  the  average  film.  The  logistic  model  displayed  in 
the  appendix  then  yields  the  probability  Pi>T  that  that  reader 
has  of  correctly  assessing  that  image  as  diseased. 

Analogous  considerations  apply  to  the  determination  of 
randomly  generated  specificities  which  vary  across  radiolo¬ 
gists  and  across  images  from  women  without  disease.  Values 
for  .parameters  FD,  bP  and  cP  need  to  be  specified  in  order 
to  define  the  data  generating  process.  Here,  FD  is  the  proba¬ 
bility  that  the  average  radiologist  will  correctly  assess  the 
average  non-diseased  image  as  such,  radiologists  vary  uni¬ 
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formly  in  (F&  —  b° ,  F°  +  bD)  in  their  specificities  to  the 
average  non-diseased  film,  and  images  from  women  without 
disease  vary  uniformly  in  (F°  -  aP,  FD  +  a°)  in  the  probabil¬ 
ities  of  the  average  reader  correctly  classifying  them.  The 
sensitivities  and  specificities  from  single  radiologists  should 
be  correlated.  In  the  Appendix  we  describe  how  negative 
correlation  between  sensitivities  and  specificities  within  ra¬ 
diologists  can  be  built  into  the  data  simulation  mechanism. 

In  summary,  for  each  study  radiologist  we  simulate  his/ 
her  sensitivity  and  specificity  to  the  average  diseased  and 
non-diseased  films,  respectively,  by  randomly  sampling  cor¬ 
related  numbers  from  (S°  -  b»  SP  +  IP)  and  (F°  -  bP,  P 
+  bD),  respectively.  For  each  study  film  we  determine  the 
sensitivity  or  specificity  that  an  average  radiologist  has  for 
it  by  randomly  sampling  a  number  from  (SD  —  al\  SD  +  a  ) 
or  (F°  -  d\  F°  +  af)).  Finally,  for  each  combination  of  film 
i  and  radiologist  r,  we  can  calculate  P£  or  Pfn  which  is  the 
probability  that  the  radiologist  will  assess  that  image  cor¬ 
rectly. 

The  P&  and  P?r  pertain  to  probabilities  before  interven¬ 
tion  in  the  treatment  and  control  groups.  One  also  needs 
to  specify  treatment  effects  in  order  that  corresponding 
probabilities  after  intervention  can  be  calculated.  We  pos¬ 
tulate  that  after  intervention  the  quantities  S/;  and  Fl)  are 
changed  to  new  values  but  that  the  variation^mong  read¬ 
ers  and  among  images  remain  the  same.  In  the  Appendix 
we  define  in  a  mathematically  precise  way  a  logistic  model 
that  incorporates  such  intervention  effects. 

5*2  Simulated  Study  Data  generation 

Having  specified  statistical  models  for  pre-  and  post-inter¬ 
vention  rating  probabilities  that  incorporate  variation 
among  radiologists  and  among  images,  we  now  turn  to  the 
!  simulation  of  study  data  in  accordance  with  the  study  design 
!  that  we  proposed  in  section  3.  The  first  step  is  to  generate 
images  and  image  sets:  This  entails  generating  M  diseased 
images  (i.e.,  M  image-specific  parameters,  one  for  each  im¬ 
age),  generating  M  non-diseased  images,  and  finally  from 
the  2M  films  choosing  M  at  random  without  replacement 
to  form  film  set  1.  The  remaining  M  films  constitute  film 
set  2.  The  next  step  is  to  generate  Rj  intervention  readers 
and  Rc'  control  readers  and  assign  them  film  sets.  That  .is, 
for  each  of  Rr  +  Rc  readers  we  generate  pairs  of  pre-  and 
j  post-intervention  sensitivities  and  specificities  to  average 
diseased  and  non-diseased  films  according  to  the  models  de¬ 
scribed  in  section  5.1.  Of  the  total  Rr  +  Rc  readers,  Rr  are 
assigned  at  random  to  the  intervention  group  and  the  re¬ 
maining  Rc  to  the  control  group.  Finally  film  set  orderings 
are  assigned  to  the  readers  with  half  of  the  intervention 
readers  selected  at  random  being  assigned  set  1  first  and  the 
other  half  assigned  set  2  first.  Similarly,  Rc/2  control  readers 
are  assigned  set  1  followed  by  set  2  and  the  other  Rc/2  read- 
|  ers  are  assigned  film  sets  in  the  opposite  order, 
j  The  final  step  in  generating  data  for  a  simulated  study  is 


to  actually  generate  the  readings  for  each  reader  and  image 
combination.  That  is,  for  each  reader  and  for  each  of  the 
M  films  in  his/her  pre-intervention  set,  a  binary  random 
variable  is  generated  which  is  his/her  assessment  of  whether 
or  not  that  image  shows  disease  using  the  probability 
Ppipftf  if  the  image  is  diseased  and  1  —  Ffjarw  if  the  image  ls 
not  diseased.  Similarly,  for  each  of  the  M  films  in  his/her 
post- intervent  ion  set  a  similar  binary  random  variable  is 
generated  using  P^*,  or  1  —  Pi?«>*t  noting  that  the  pre-  and 
post-probabilities  differ  by  different  amounts  for  interven¬ 
tion- versus-control  radiologists. 

Having  generated  the  simulated  study  data  the  test  statis¬ 
tics  of  interest  can  now  be  calculated.  Data  are  simulated 
(first  the  probabilities,  then  the  ratings)  and  results  calcu¬ 
lated  under  the  same  assumptions  and  study  design  many 
times,  with  1000  or  5000  simulated  datasets  being  typical 
numbers  used  for  power  calculations.  The  proportion  of  sim¬ 
ulated  studies  in  which  the  null  hypothesis  is  rejected  is 
the  calculated  study  power  for  that  design  and  under  those 
assumptions. 


6.  POWER  CALCULATIONS: 

RESULTS  FOR  THE  MQIP  STUDY 

To  fix  ideas,  we  now  illustrate  the  computer  simulation 
method  for  power  calculations  in  the  MQIP  study.  This  il¬ 
lustration  also  identifies  some  sources  of  data  to  guide  as¬ 
sumptions  for  power  calculations. 

We  need  to  choose  assumed  parameters  for  the  baseline 
sensitivities  and  specificities,  for  the  variations  among  radi¬ 
ologists  and  among  images  and  for  intervention  effects  of 
interest.  We  assume  that  the  median  sensitivity  pre-inter¬ 
vention,  Sl\  in  our  study  will  be  in  the  range  of  0.70  to 
0.80.  This  accords  with  previous  studies  that  found  median 
sensitivities  of  0.70  and  0.80  [3,4].  Median  pre- intervention 
specificity  will  also  be  assumed  to  lie  in  the  range  of  0.70 
to  0.80.  Beam  et  al  [4]  found  a  median  specificity  of  0.94 
for  mammograms  from  women  with  normal  mammograms 
and  a  median  specificity  of  0.60  for  mammograms  from 
women  with  benign  disease.  Elmore  et  al  [3]  found  a  median 
specificity  of  0.94.  In  contrast  to  these  studies,  we  will  in¬ 
form  the  radiologists  of  the  average  prevalence  that  is 
higher  than  that  expected  in  a  practical  screening  setting. 
Because  of  this  and  the  fact  that  the  films  in  our  study,  will 
be  somewhat  difficult,  we  anticipate  an  initial  specificity 
lower  than  observed  in  those  studies.  The  variation  amongst 
radiologists  in  sensitivities  and  specificities  will  be  assumed 
such  that  bD  =  0.20  and  b°  =  0.20,  which  is  in  agreement 
with  the  range  of  approximately  40%  in  sensitivities  (and 
specificities)  among  radiologists  observed  in  Beam’s  study. 
We  could  find  no  data  on  inter-image  variability  to  suggest 
appropriate  values  for  aD  and  aD .  We  assume  that  they  are 
of  the  same  order  of  magnitude  as  the  inter-rater  variability 
parameters,  aD  =  cP  =  0.20.  With  regard  to  intervention 
effects  of  interest,  we  consider  that  changes  of  10  percentage 
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Design 


TABLE  1.  Power  to  detect  a  10%  increase  in  sensitivity  and  no  effect  on  specificity  in  the  intervention  group 


Readers 
per  group 
(2?r) 

Films 
per  set 
(M) 

Pre-intervention 

sensitivity 

Pre-intervention 

specificity 

Power 

Within 

intervention  group 

Comparison  with 
control  group 

20 

30 

0.70 

0.70 

0.70 

0.38 

20 

30 

0.70 

0.80 

0.66 

0.34 

20 

30 

0.80 

0.70 

0.79 

0.45 

20 

30 

0.80 

0.80 

0.77 

0.44 

20 

45 

0.70 

0.70 

0.81 

0.48 

20 

45 

0.70 

0.80 

0.82 

0.53 

20 

45 

•  0.80  • 

0.70 

0.91 

0.61 

20 

45 

0.80 

0.80 

0.92 

0.64 

30 

30 

0.70 

0.70 

0.81 

0.48 

30 

30 

:  0.70 

0.80 

0.83 

0.52 

30 

30 

0.80 

0.70 

0.93 

0.60 

30 

30 

0.80 

0.80 

0.91 

0.61 

30 

45 

0.70 

•  0.70 

0.94 

0.66 

30 

45. 

0.70 

0.80 

0.95 

0.66 

30 

45 

0.80 

0.70 

0.99 

0.80 

30 

45 

0.80 

0.80 

0.99 

0.79 

40 

30 

0.70 

0.70 

0.92 

0.61 

40 

30 

0.70 

0.80 

0.94 

0.60. 

40  . 

30 

0.80 

0.70 

0.97 

0.73 

40 

30 

0.80 

0.80 

0.98 

0.75 

40 

45 

0.70 

0.70 

0.98  > 

0.79 

40 

45 

0.70  '  ^ 

0.80 

0.99 

0.80 

40 

45 

0.80 

0.70 

0.99 

0.88 

40 

45 

0.80  ‘ 

0.80 

0.99 

0.89 

All  tests  are  two  sided  and  are  tested  at  a  Significance  level  of  0.05. 


points  in  either  sensitivity  or  specificity  are  of  interest. 
However,  we  calculated  power  for  a  variety  of  intervention 
effects. 

Practical  considerations  concerning  time  and  cost  dictate 
the  range  of  sample  sizes  that  are  feasible  and  therefore,  for 
which  power  calculations  are  performed.  We  anticipate  that 
no  more  than  approximately  80  radiologists  are  available 
for  the  reading  study  in  the  rural  communities  in  which  our 
mammography  quality  improvement  study  is  being  con¬ 
ducted.  To  maximize  power,  equal  numbers  of  radiologists 
are  assigned  to  control  and  intervention  groups.  Therefore 
the  number  of  radiologists  per  group  to  be  considered  for 
power  calculation  purposes  will  be  in  the  range  of  20-40. 
Experience  suggests  that  readers  can  comfortably  read  no 
more  than  45  films  per  session.  We  therefore  calculated 
power  for  experiments  in  which  the  number  of  films  per  set, 
M,  was  either  30  or  45. 

Estimates  of  power  based  on  computer  simulations  are 
shown  in  Table  1.  Though  results  are  shown  only  for  inter¬ 
vention  effects  on  sensitivity  with  no  effect  on  specificity, 
because  of  the  symmetry  inherent  in  the  design,  the  same 
power  calculations  hold  fora  10%  change  in  specificity  with 
no  change  in  the  sensitivity.  Observe  that  the  power  is  far 
larger  for  the  within  intervention  group  assessment  of 


change  than  for  the  between  group  comparison  of  change. 
This  is  to  be  expected  since  the  variability  involved  in  com¬ 
paring  two  random  changes  is  greater  than  the  variability 
involved  in  comparing  a  single  change  with  the  null  hy¬ 
pothesis  of  no  change.  We  also  observe  from  Table  1  that 
the  power  is  less  when  the  baseline  sensitivity  is  0.70  than 
when  it  is  0.80.  This  is  due  to  the  relatively  larger  binomial 
variance  for  the  lower  baseline  rate.  To  be  conservative  we 
focus  on  this  lower  rate.  Interestingly,  the  baseline  specific¬ 
ity  had  little  impact  on  the  power  to  detect  an  intervention 
effect  on  the  sensitivity. 

The  target  power  for  our  study  design  is  90%,  which 
allows  a  10%  chance  of  an  inconclusive  result  when  the 
intervention  increases  sensitivity  from  0.70  to  0.80.  For  the 
within  intervention  group  comparison  this  cannot  be 
achieved  with  20  readers,  but  it  can  be  achieved  with  30 
readers  if  45  images  are  included  in  each  image  set.  The 
between  group  comparison,  however,  has  a  power  of  only 
66%  in  this  case.  Even  with  use  of  our  maximum  resources, 
i.e.,  40  readers  per  group  and  45  images  per  reading  set,  the 
power  is  only  80%.  This  allows  for  a  20%  chance  of  an 
inconclusive  result  even  when  there  is  a  clinically  impor¬ 
tant  intervention  effect  on  diagnostic  accuracy. 

For  the  MQIP  study  we  chose  not  to  include  a  control 
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Design  of  a  Study  to  Improve  Accuracy  in  Reading  Mammograms 


TABLE  2.  Study  power  to  detect  various  configurations  of 
changes  in  the  intervention  group  using  a  study  design  with 
30  readers  and  45  films  per  set 


pre-intervention 

sensitivity 

AT(sens) 

4r(spec) 

Power 

0.60 

+0.10 

0.00 

0.90 

0.70 

+0.10 

0.00 

0.95 

0.80 

+0.10 

0.00 

0.98 

0.60 

+0.05 

0.00 

0.35 

0.70 

+0.05 

0.00 

0.39 

0.80 

+0.05 

0.00 

0.50 

0.60 

+0.05 

+0.05 

0.66 

0.70 

+0.05 

+0.05 

0.68 

0.80 

+0.05 

+0.05 

0.71 

The  pre- intervention  specificity  is  assumed  to  be  0.70  in  all  cases.  The 
intervention  induced  change  in  sensitivity  as  denoted  4r(sens)  and  in 
specificity  is  denoted  4r(spec). 


group  in  the  reading  study  component,  but  instead  to  focus 
the  study  on  the  within  group  comparison.  The  power  cal¬ 
culations  were  an  important  contribution  to  this  decision 
but  other  considerations  also  played  a  role.  Radiologists 
would  have  little  motivation  to  participate  in  the  control 
arm  whereas  they  would  receive  continuing  medical  educa¬ 
tion  (CME)  credit  for  participation  in  the  intervention  arm. 
The  possibility  that  those  in  the  control  arm  4*ould  learn 
from  the  baseline  assessment  was  also  a  concern  and  thus 
we  were  concerned  that  it  might  not  even  be  feasible  to 
construct  a  true  control  group.  Finally,  it  was  felt  that  if  we 
found  a  definite  positive  change  in  the  intervention  group, 
then  this  would  provide  sufficient  motivation  to  proceed 
with  more  comprehensive  controlled  studies  in  the  future. 
Thus  we  chose  to  study  Only  the  intervention  effects  in  the 
intervention  group  and  to  use  sample  sizes  of  30  radiologists 
each  reading  sets  of  mammograms  from  45  women  before 
and  after  intervention. 

The  simulation  program  allowed  us  the  flexibility  to  ex¬ 
plore  the  performance  of  this  study  design  in  a  variety  of 
settings  other  than  that  assumed  for  the  primary  sample  size 
calculation.  First  we  calculated  the  probability  of  rejecting 
the  null  hypothesis  for  settings  where  there  was  no  inter¬ 
vention  effect.  Recall  that  inference  for  the  test  statistic  is 
based  on  a  chi-square  statistic  and  is  theoretically  valid  with 
large  samples.  However,  this  study  entails  relatively  small 
samples.  We  used  the  simulations  to  check  the  adequacy  of 
the  large  sample  theory  in  our  study.  To  do  this  we  gener¬ 
ated  data  under  the  null  hypothesis.  The  rejection  probabil¬ 
ity  was  approximately  0.06  in  the  settings  we  studied,  indi¬ 
cating  that  the  true  significance  level  of  the  test  is  slightly 
higher  than  the  target  of  0.05  but  adequate  for  our  purposes. 

We  next  explored  the  power  of  this  study  design  and  sam¬ 
ple  sizes  to  detect  an  array  of  intervention  effects.  Results 
are  shown  in  Table  2.  Although  the  study  has  adequate 
power  to  detect  a  change  in  sensitivity  (or  specificity)  of 
0*10  even  when  the  pre-intervention  sensitivity  is  as  low 
as  0.60,  it  has  little  chance  of  detecting  a  smaller  change 
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of  0.05.  On  the  other  hand,  if  small  changes  of  the  order  of 
0.05  occur  in  both  the  average  sensitivity  and  in  the  average 
specificity  there  is  a  good  chance  that  the  simultaneous  ef¬ 
fects  will  be  detected. 

7.  DISCUSSION 

Diagnostic  imaging  technology  is  already  a  basic  component 
of  medical  care  and  continues  to  develop  at  a  rapid  pace. 
It  is  clearly  important  to  assess  the  accuracy  with  which 
readers  can  diagnose  disease  using  such  technologies,  to 
evaluate  the  effects  of  training  strategies  and  to  compare 
methods.  Implications  for  public  health  can  be  enormous. 
Unfortunately,  statistical  methodology  for  evaluating  and 
comparing  imaging  methods  has  not  received  much  atten¬ 
tion  by  biostatisticians  and  epidemiologists  involved  in  pub¬ 
lic  health  research.  Rather  the  literature  is  concentrated  in 
radiology  research  journals,  has  generally  focused  on  small 
scale  studies  involving  only  a  few  readers  and  has  ignored 
clinical  implications  associated  with  different  diagnostic 
categories.  We  believe  that  it  is  time  to  bring  the  discussion 
about  study  design  and  analysis  for  evaluating  imaging  tech¬ 
nology  to  the  broader  community  of  epidemiologists  and 
statisticians  involved  in  public  health.  This  is  particularly 
important  as  interest  increases  in  the  accuracies  and  costs 
of  these  imaging  methods.  By  presenting  our  thoughts  on 
the  design  and  analysis  of  a  study  to  evaluate  an  educational 
intervention  on  the  interpretation  of  mammograms,  we 
hope  to  stimulate  such  discussion. 

The  choice  of  primary  outcome  measure  is  the  most  basic 
element  of  any  study  design.  We  chose  to  consider  the  sensi¬ 
tivity  and  specificity  as  the  basis  for  evaluating  intervention 
effects.  This  conflicts  with  initial  statistical  reviewers  of  our 
study  design  who  were  of  the  opinion  that  ROC  analysis  was 
the  only  appropriate  and  indeed  the  state-of-the-art  basis  for 
evaluating  an  intervention  effect.  We  now  argue  that  in 
mammography  where  specific  clinical  actions  are  associated 
with  diagnostic  rating  categories,  sensitivity,  and  specificity 
provide  a  more  clinically  relevant  and  conceptually 
straightforward  basis  for  comparison  than  does  ROC  analy¬ 
sis.  Moreover  this  approach  allows  us  to  evaluate  effects  on 
false  positive  as  well  as  true  positive  rates.  In  contrast  ROC 
analysis  does  not  quantify  the  false  positive  rates  directly 
but  in  a  sense  only  uses  it  to  standardize  the  true  positive 
rate.  We  do  not  dismiss  ROC  analysis  entirely  but  rather 
we  regard  the  analysis  of  the  specific  rating  categories  of 
secondary  importance  and  focus  the  design  on  sensitivity 
and  specificity.  Thus  the  MQIP  study  was  designed  to  ensure 
adequate  power  to  detect  changes  in  the  most  clinically  rel¬ 
evant  quantities. 

We  also  needed  to  decide  upon  the  analysis  techniques 
for  making  statistical  inference  about  sensitivity  and  speci¬ 
ficity.  We  propose  to  simultaneously  estimate  sensitivity 
and  specificity  using  multivariate  methods.  Sensitivity  and 
specificity  as  we  have  defined  them  are  average  sensitivities 
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and  average  specificities  of  radiologists  in  our  study.  They 
can  also  be  interpreted  as  marginal  or  population  average 
quantities,  in  the  sense  of  being  the  probability  that  a  dis¬ 
eased  (or  non-diseased)  image  will  be  correctly  interpreted 
as  such  in  the  study.  The  distinction  between  the  popula¬ 
tion  average  and  average  radiologist-specific  interpretations 
has  to  do  with  whether  one  considers  the  accuracy  parame¬ 
ters  to  be  based  on  data  pooled  across  radiologists  (popula¬ 
tion  average)  or  to  be  based  on  calculation  of  the  accuracy 
parameter  for  each  radiologist  and  then  averaging  the  re¬ 
sults.  In  our  study  these  quantities  coincide  because  all  radi¬ 
ologists  expect  to  read  the  same  numbers  of  films.  In  studies 
where  this  is  not  the  case,  the  distinction  should  be  consid¬ 
ered  and  a  decision  should  be  made  regarding  which  of  the 
two  entities  is  most  relevant. 

The  approach  we  propose  for  statistical  inference  is  rela¬ 
tively  straightforward,  being  based  on  methods  for  inference 
about  sample  means.  Confidence  intervals  are  based  on  the 
variance-covariance  matrix  of  the  estimated  (sensitivity, 
specificity)  parameters  or  their  changes  amongst  radiolo¬ 
gists.  Possible  non-normality  of  the  average  estimates  may 
be  an  issue  in  our  study,  though  for  the  settings  considered 
in  the  power  calculation  this- did  not  appear  to  be  the  case. 
An  alternative  approach  to  inference  which  might  be  more 
robust  would  follow  the  marginal  regression  modeling  ap¬ 
proach  described  by  Leisenring,  Pepe,  and  Longton  [17]. 
One  could  formulate  logistic  regression  models  for  the  popu¬ 
lation  average  sensitivity  and  1 -specificity  as 

logit  {Prob[screen  positive  |  image  diseased]} 

=  Yo  +  7\b 

logit  {Prob[screen. positive  |  image  non-diseased]} 

=  %  +  Vi  b 

where  the  logit  function  is  logit  { x }  =  In  {x/(l  -  x)}  and 
b  is  0  if  the  image  was  read  before  the  intervention  and  1 
if  it  was  read  after  the  intervention.  The  changes  in  the 
true  and  false  positive  rates  are  now  quantified  in  the  odds 
ratio  parameters  y  and  7) h  respectively,  and  joint  confi¬ 
dence  intervals  can  be  calculated.  By  adding  an  interaction 
term  between  b  and  I,  where  1  is  an  indicator  of  the  radiolo¬ 
gist  being  in  the  control  or  intervention  groups: 

logit  (Prob[screen  positive  |  image  diseased]} 

=  Ye  +  Y\b  +  Yibl 

logit  (Prob[screen  positive  |  image  non-diseased]} 

=  Ve  +  Tj\b  +  7]  2bl 

a  comparison  of  the  changes  in  the  intervention  and  control 
groups  can  be  made  by  testing  if  the  parameters  y2  or  r/i  are 
0.  Though  this  logistic  regression  modeling  approach  may 
provide  more  robust  confidence  intervals,  we  felt  that  the 
simpler  approach  described  earlier  was  adequate  for  power 
calculations. 


The  prototype  reading  study  we  have  described  concerns 
evaluating  the  effect  of  an  intervention  on  the  change  in 
accuracy  parameters.  We  note,  however,  that  most  of  our 
discussion  is  also  relevant  to  the  comparison  of  accuracies 
associated  with  different  imaging  modalities.  Suppose  for 
example,  that  there  are  two  sets  of  women  (denoted  by  set 
1  and  set  2)  from  which  images  have  been  made  using  two 
modalities.  A  natural  study  design  to  compare  the  modal¬ 
ities  would  entail  readers  assigned  to  read  one  set  of  films 
produced  with  one  modality  and  the  other  set  of  films  pro¬ 
duced  with  the  other  modality.  Using  the  notation  1(A)  to 
denote  set  L  produced  with  modality  A  and  similarly  for  the 
other  combination,  readers  read  either  {1(A)  and  2(B)}  or 
{2(A)  and  1(B)}.  Considering  that  the  ordering  may  also 
influence  accuracy  parameters,*  this  yields  four  groups  of 
readings,  {1(A),  2(B)},  {2(B),  1(A)},  {2(A),  1(B)}  and 
{1(B),  2(A)}.  A  balanced  cross-over  design  would  assign 
radiologists  randomly  to  these  four  reading  assignments. 
The  difference  in  the  sensitivity  and  specificity  between 
modality  A  and  B  can  be  calculated  by  simply  pooling  all 
relevant  readings  for  modality  A  and  similarly  for  modality 
2.  Inference  for  the  difference  follows  in  the  same  fashion 
as  that  described  for  the  change  induced  by  intervention  in 
the  intervention  group  of  our  study  but  that  now  there  are 
4  rather' than  2  strata  of  radiologists  defined  by  the  image 
reading  set  assignments. 

Power  calculations  for  reading  studies  are  not  straightfor¬ 
ward  due  in  part  to  correlations  induced  by  images  and  read¬ 
ers.  That  is,  for  each  image  there  are  multiple  readings. 
Moreover,  each  reader  provides  multiple  readings  and  radi¬ 
ologist  specific  sensitivities  and  specificities  are  correlated. 
We  propose  simple  analyses  for  dealing  with  these  factors 
but  power  calculations  required  a  computer  simulation  ap¬ 
proach.  We  found  the  process  of  developing  the  computer 
simulation  study  to  be  a  useful  exercise.  It  compels  one  to 
think  through  the  processes  generating  study  data.  It  also 
allows  one  to  experiment  with  the  assumptions  and  design 
easily.  For  example,  we  considered  designs  that  included  a 
larger  number  of  film  sets  to  be  read  in  the  study  and  found 
that  the  study  power  was  decreased  slightly  due  to  the  extra 
variation  introduced.  Computer  simulations  also  allow  one 
to  check  how  test  statistics  perform  under  the  null  hypothe¬ 
sis  with  sample  sizes  proposed  in  the  study.  Hence  one  can 
check  if  inference  based  on  large  sample  theory  is  valid  in 
the  setting  where  it  is  to  be  applied.  We  suggest  that  simula¬ 
tion  studies  are  a  useful  approach  to  power  calculations  in 
any  setting,  though  given  the  complexities  in  radiology 
reading  studies,  the  case  for  the  technique  in  this  setting  is 
particularly  strong. 

We  appreciate  the  support  of  grants  GM54438  and  CA 63146  awarded 
by  the  National  Institutes  of  Health ,  and  grant  DAMDl  7-96-1 -6288 
awarded  by  the  Department  of  Defense.  We  thank  Molly  Edmonds  for 
her  excellent  technical  help  m  preparing  the  manuscript. _ ■ 
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APPENDIX  A 

1.  VARIANCE  ESTIMATORS  FOR  CHANGE  IN 
OVERALL  SENSITIVITY  AND  SPECIFICITY 

The  change  in  the  overall  sensitivity  defined  in  Section  4  can  be 
written  formally  mathematically  as 


:rded 
3288 
Is  for 


dT(sensitivity) 


Rt 


(s,,r„  -  s,rt) 


=  U) 

+ 


reorder  =  2.1)  * 


where  Snrre  is  the  observed  sensitivity  for  radiologist  r  with  his  pre- 
intervention  film  set  and  SrifXW  is  the  corresponding  quantity  post¬ 
intervention.  Observe  that  the  order  of  film  sets  essentially  defines 
two  strata  in  this  setting  and  the  notation  (order  =  1,2)  (or  [order 
=  2,1])  used  to  denote  the  stratum  in  the  summation  indicates 
that  it  includes  only  radiologists  assigned  sets  in  the  order  set  1 
first  and  set  2  second  (or  set  2  first  and  set  1  second).  The  variance 
of  iT(sensitivity)  can  be  estimated  using  the  variance  of  a  stratified 
sample  mean  V  -  0.5(V(U)  +  Van)/RT,  where  V(U)  *  the  sample 
variance  of  the  quantities  (ST<rfC  ~  Sr>IV>,)  in  the  stratum  (order  — 
1,2),  and  V2  is  the  analogous  quantity  in  the  other  stratum.  The 
ratio  Ar  (sensitivity )Hv  can  be  compared  with  a  standard  normal 
distribution  to  test  for  a  change  in  the  sensitivity  which  is  statisti¬ 
cally  significantly  different  from  0. 


2.  Chi-Square  Test  Statistics  for  Bivariate  Analyses 

To  simultaneously  test  the  null  hypotheses  that  both  the  sensitiv¬ 
ity  and  specificity  are  unchanged  in  the  intervention  group, 

At (sensitivity)  =  0  =  Ar (specificity),  the  following  test  statistic 
can  be  used 

At  (sensitivity) 

At  (specificity) 


[Ar(sensitivity)  AT (specificity)]  ^ 


where  the  square  bracket  notation  is  used  to  denote  vectors  and. 
Xj 1  is  the  inverse  of  a  square  matrix  JT.  This  matrix  It  isa  vari¬ 
ance-covariance  matrix  for  the  two-dimensional  statistic  \AT (sen¬ 
sitivity)  At  (specificity)],  and  is  the  analogue  of  the  variance  V 
defined  above  in  relation  to  the  one-dimensional  quantity  Ar(sen- 
sitivity).  Fonnally  we  write 


7(Rt  -  i) 


where  J5VU>  is  the  sample  variance-covariance  matrix  for  the  quan¬ 
tities  {Sr>r,M  -  S,.r,  Pr.^  -  Ft.jxc)  in  the  stratum  (order  =1,2), 
and  iV2J)  is  the  analogous  quantity  calculated  for  the  other  stra¬ 
tum.  The  test  statistic  is  compared  with  a  standard  chi-square  dis¬ 
tribution  with  2  degrees  of  freedom  in  order  to  test  the  null  hy¬ 
pothesis  concerning  changes  in  sensitivities  and  specificities. 

Consider  now  the  component  of  the  data  analysis  concerning 
the  comparison  of  changes  between  intervention  and  control 
groups.  Using  a  subscript  C  to  denote  the  control  group  in  analogy 
with  our  use  of  the  subscript  T  to  denote  the  intervention  group, 
we  define  the  statistics  Ac  (sensitivity),  Ac  (specificity)  and  2C- 
The  estimated  differences  between  the  groups  in  changes  of  sensi¬ 
tivities  and  specificities  can  be  written  as  AT(sensitivity)  — 
Ac  (sensitivity)  and  Ar(specificity)  -  Ac  (specificity),  respectively. 
The  hypothesis  that  the  changes  are  the  same  for  intervention  and 
control  groups  can  be  tested  by  comparing  the  statistic 


[Ar(sens)  -  Ac(sens)  Ar(spec)  -  Ac(spec)] 

Aj  (sens)  -  Ac(sens) 
AT(spec)  -  Ac( spec) 
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with  the  quantiles  of  a  chi-square  distribution  with  2  degrees  of 
freedom,  where  we  use  the  abbreviations  “sens”  and  “spec”  to  de¬ 
note  “sensitivity”  and  “specificity”  in  the  above  expressions. 

3.  Mixed  Models  for  Reading  Accuracies 

Section  5  outlines  a  statistical  model  for  sensitivity  and  specificity 
parameters  which  vary  with  reader  and  image.  Here  we  present  a 
more  formal  and  precise  definition  of  this  model.  For  radiologist 
r  on  diseased  film  i,  we  write  the  chance  of  correctly  identifying 
it  as  diseased  pre-intervention  using  a  logistic  model  as 

PS  =  exp  {//D  +  f  +  P?}K\  +  exp  {jP  +  +  P?}) 

where  yp  and  are  random  variables  specific  to  this  film  and 
radiologist,  respectively.  For  the  average  radiologist  =  0,  and 
for  the  average  film  yP  =  0.  Thus  for  the  average  radiologist  on 
the  average  film  the  sensitivity  is  S°  =  exp{//,J>}/(  1  +  exp{//;}). 
The  films  vary  in  difficulty  in  the  sense  that  the  average  radiologist 
has  a  lower  sensitivity  on  some  films  and  a  higher  sensitivity  on 
others.  Mathematically  this  translates  into  allowing  )P  to  vary. 
We  choose  it  as  a  random  variable  so  that  the  average  radiologist’s 
sensitivity  to  different  films  varies  uniformly  in  an  interval  (S/y  — 
cl\  Sl)  +  al)).  Technically  this  is  achieved  by  letting  yP  =  In 
{UP/(1  -  l/P)}  —  where  U,°  is  a  random  variable  v(ith  a  uni¬ 
form  distribution  in  ( S°  -  af\SD  +  an)-  The  radiologists  also  vary 
amongst  themselves  in  their  sensitivities  to  thesame  film  and  this 
inter-rater  variation  translates  into  allowing  PP  to  vary.  We  simu¬ 
lated  data  so  that  on  the  average  diseased  film  (i.e.,  yP  =  0)  the 
sensitivities  of  radiologists  varied  uniformly  in  (Sp  —  b!\  Sn  +  bn). 
Again,  technically  we  let  p ?  =  In  {U?l(  1  -  Uj;)}  “  fP  where 
UP  is  a  random  variable  with  ft  uniform  distribution  on  the  interval 

(Sw  -  b'\  SD  +  b°). 

Turning  now  to  specificities,  we  write  the  specificity  for  radiolo¬ 
gist  r  on  non-diseased  film  j  pre- intervention  as 

P,°(  =  exp  {n(>  +  />  +  #>}/(  1  +  exp  {//’  +  y?  +  /??}) 

where  in  analogy  with  the  above  notation  for  diseased  films,  the 


average  radiologist  on  the  average  film  has  specificity  F 
exp{//D}/(l  +  exp{//D)  and  parameters  aD  and  b°  indicate  varia¬ 
tion  in  the  specificity  with  film  and  radiologist.  As  argued  in  sec¬ 
tion  5,  data  should  be  generated  so  that  the  p ?  and  A°are  nega¬ 
tively  correlated.  We  incorporated  this  into  the  simulation  by  first 
generating  the  sensitivity  radiologist-specific  random  effect  param¬ 
eter,  pD„  (i.e.,  his/her  sensitivity  to  the  average  film)  which  is 
based  on  the  random  variable  U?,  and  then  letting  the  correspond¬ 
ing  random  variable  for  the  specificity  random  effect  be  defined 
as 

u?  =  -  (UP  -  SD)  j^J. 

Thus  if  the  radiologist’s  sensitivity  is  x  X  bD  above  the  average 
radiologist’s  sensitivity  to  the  average  film,  SD,  his/her  specificity 
will  he  x  X  b^  below  the  average  specificity  to  the  average  film. 

Our  model  postulates  that  after  intervention  the  quantities  Ff) 
and  SD  are  changed  to  new  values  but  that  the  radiologist  and 
image-specific  parameters  remain  unchanged.  Thus,  suppose  that 
after  intervention  the  sensitivity  of  the  average  radiologist  to  the 
average  film  is  exp(//D  +  oP}K  1  +  exp{//D  +  aD}).  Then  the 
chances  that  radiologist  r  will  correctly  classify  film  i  pre-  and  post¬ 
intervention  are 

i 

P?i, ^  =  exp{//D  +  y?  +  /??}/( 1  +  exp{//,)  +  r?  +  P ?}) 

and 

PS...  =  exp{//D  +  al)  +  y?  +  Pl}} 

l(\  +  exp{//f)  +  al)  +  y?  +  /?”}), 

respectively.  Similarly  the  postulated  change  in  F^  specifies  a  pa¬ 
rameter  al)  (analogous  to  a') )  which  facilitates  calculation  of  post¬ 
intervention  specificities.  Having  chosen  values  for  the  various  pa¬ 
rameters  (//'\  af\  al\  bn)  and  (jp,  od\  al\  bl)),  this  completes 
the  first  step  of  the  simulation  power  calculation  method,  namely 
specification  of  accuracy  parameter  distributions  pre-intervention 
and  intervention  effects. 
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Mammography  Data  Collection  Form 
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Providence  Centralia  Hospital  central®,  wa 

PATIENT  INFORMATION _  Patient  ID  or  file  number: 

Social  Security  Number: _ - _ - _ Telephone  Number _ 

First  Name _ Last  Name _ Middle  Initial _ Date  of  Birth  /  / 


Street  Address. 


.City. 


State . 


Zip. 


ETHNIC  BACKGROUND 
□i  Caucasian/White 
□2  African  American/Black 
□3  Native  American/Eskimo/Aleut 
□4  Asian/Pacific  Islander 
□a  Other 

HISPANIC/LATINA  ORIGIN  Q>  No 


EDUCATION  (check  only  one) 

□1 1-11  Years 
□2  High  school  graduate 
□3  Some  college/technical  school 
□4  College  graduate  (4  years) 

□s  Post  graduate  degree 


□1  Yes 


HEALTH  INSURANCE  (check  all  that  apply) 
□1  None 
□1  Medicare 
□1  Medicaid 
□1  HMO,  Managed  Care 
□1  Private  Insurance  Company 
□1  Other 
Hi  Not  Sure 


1.  Have  you  ever  had  breast  cancer? 


2.  Has  your  mother  had  breast  cancer? 

CIo  No 
□1  Yes  “ 

Cta  Not 

sure 

3.  How  many  of  your  sisters  had  breast  cancer? 
□e  I  have  no  sisters 

□0  None  of  my  sisters 
□1  One  sister  ' 


□2  Twd:pfc^pre;^i?teM:^i^  : 

□9  Not  sure 

rrrzJm 


If  :yes,;:vi^e 

□0 : 
□1  Yes;-  pnasipt^ 

□2  Yes^^<or^ii^lsfeis, 

jr-  . 


4.  How  many  of  your  daughters  had  breast 
cancer? 

de  I  have  no  daughters 


02:T^^ 

□i  Nof  sure"  •  vv  ' 


If  yes, ^drei^ahy  of  your  da 

•  ...  .  .  V  . - 

ughters:  under  age 

50  when  diagnosed?  ’ 

□0  No 

□t  Yes,  op,e  daughter  only 

□2  Yes  Jwaor  ^ 

□a  Notsure  :  '  C  -; 

iters  : 

5.  Has  any  relative  had  ovarian  cancer? 
do  No 

di  Mother,  sister  or  daughter 
d2  Aunt  or  grandmother 
d3  Other  relative 
do  Not  sure 


6.  Previous  breast  procedures 


(check  all  that  apply) 

Left 

Right 

Both 

Fine  Needle  Aspiration 

□1 

□2 

□3 

Core  Needle  Biopsy 

□1 

□a 

□3 

Open  Excisiona!  Biopsy 

□1 

□a 

□3 

Lumpectomy 

□1 

□a 

□3 

Mastectomy 

□a 

□3 

Radiation  Therapy 

□1 

□a 

□3 

Reconstruction 

□1 

□a 

□3 

Augmentation/Implants 

□1 

□a 

□3 

7.  Date  of  most  recent  breast  biopsy: 

_ L _ /  di  Never  had  a  biopsy 

8.  Your  age  at  the  birth  of  your  first  child: 
_ doo  I  have  no  natural  children 

9.  Have  your  menstrual  periods  stopped 
permanently?  (check  only  one) 

do;No . :  T  -I 

□j; No;, blit  my ^periods  raiire:  lessTrequent1 
d2 1  now  have  bleeding  from  hormone  1 
replacement 

da  Yes,  my  periods  stopped  naturally 
■(menopause):1:'/ 


v"  ■  Trr—r/: - ...... 

dfYes,  my  periods  stopped 
due  to  surgery;  ' 

do  Not  sure 


'y* 

Deridds:sfopped?v^v :  '  •  ;'/•;  . ' 


If  no,  what  is  the  approximate  length  in  days;of  :• : 
your  menstrual  cycle?  _ _ 

And,  what  was  jthe  date  of  the  start  of  your  last:  *; 
menstrual  cycle  ^pleaseestlmateifyou  don’fynow 
the  exac&davYr::  /  7  : :  *  ' 

10.  Have  you  had  one  or  both  ovaries 

removed? 

do  No 

di  Yes,  one  ovary  removed 
da  Yes,  two  ovaries  removed 
□9  Not  sure 

DO  NOT  WRITE  BELOW  THIS  LINE 


11.  Are  you  currently  using  any  hormones? 

(check  all  that  apply) 
di  No 

di  Yes,  Estrogen  only 

di  Yes,  Estrogen  and  Progesterone 

di  Yes,  Tamoxifen 

di  Yes,  birth  control 

di  Yes,  other  hormone 

di  Not  sure  7 

t 

12.  Have  you  had  any  problems  or  symptoms 
with  your  breasts  in  the  last  3  months? 

do  no  _ _ _ 


Right  ::jB6th 

Lump 

□2^  3C3s 

Nipple  discharge 

.//vSr 

02'  ;03 

Pain  .  :  • ;  ; 

./;>:■  m 

□2  ,q3 

'SWp;phfnges:;:^.: 

□2  :/p; 

Other- 

02  :  J&' 

13.  Did  you  make  this  appointment  due  to  a 
concern  about  a  breast  problem  found  In  the 
past  3  months?  (check  one) 

do  No,  this  Is  a  routine  mammogram, 
di  Yes,  I  found  something  new. 
d2  Yes,  my  doctor  found  something  new. 
d3  Yes,  I  have  a  general  concern  but  no 
specific  symptoms. 

14.  Have  you  had  aprevious  mammogram? 
do  No  -di  Yes  do  Not  sure 


If  yes,  what  was  the  date  of  your  last 
mammogram?  / _ _J _ 


15.  Have  you  ever  had  a  clinical  breast  exam 
(a  physical  breast  exam  performed  by  a  health 
care  provider)? 

do  No  di  Yes  do  Not  sure 


Ifyesf.h9wlongslnce:yourlast  clinical 
breast 

□1  Within  thetesf:8;po!nths 


/ 


30 


DO  NOT  WRITE  BELOW  THIS  LINE 


Facility/Exposure  Site  _ _ _ _ 

1.  Physical  exam  results 
□o  Negative 

Di  Positive  (suspicious  for  malignancy) 

□2  Not  performed 

2.  Symptoms  (check  all  that  apply) 

□1  None 

□1  Lump 

Dt  Bloody  nipple  discharge 
□1  Pain 

□1  Other _  _ 

3.  Was  patient  referred  because  of  symptoms  detected 
by  CBE  performed  within  the  last  3  months? 

□0N0  DiYes  D©  Unknown 

4.  Date  of  last  mammogram  _ / _ / _ 

5.  Comparison  films  available?  Do  No  Di  Yes 

6.  Reason  for  mammogram 
(check  only  one) 

□0  Screening  (asymptomatic) 

□1  Diagnostic  (symptomatic) 

□2  Short  Interval  follow-up 
□a  Additional  view(s)  for  current  exam 
□4  Special  study 

□1  Other _ _ _ _ 

7.  Procedure 

□3  Bilateral  mammography 
□2  Right  only 
□1  Left  only 


Radiologist  ID _ 

8.  Density  (code  breast  with  greatest  density) 

□t  Mostly  fatty 

□2  Scattered  fibroglandular  tissue 
Da  Heterogeneously  dense 
□4  Extremely  dense 

fi.  Assessment  -  Right  Breast 

Do  Needs  additional  evaluation 
□1  Normal 
□2  Benign  finding 

Da  Probably  benign;  short  follow-up 
D4  Suspicious  abnormality 
Dc  Highly  suspicious  for  malignancy 

10.  Assessment  -  Left  Breast 
Do  Needs  additional  evaluation 
Dt  Normal 

D2  Benign  finding 

Da  Probably  benign;  short  follow-up 
O4  Suspicious  abnormality 
D*  Highly  suspicious  for  malignancy 

11.  Assessment  based  on:  (check  alt  that  apply) 

Di  Basic  2  views  per  breast 

Dt  Additional  views 

Oi  Left  breast  D2  Right  breast  Da  Both  breasts 
Dt  Clinical  findings 
Dt  Referring  physician’s  report 
Dt  Comparison  with  previous  films 
Dt  Patient  report 
Di  Ultrasound 
Dt  Family  history 
Di  Patient  history 


— ...umi  uiu  iooi .o-KiuninS' 


_  Date  of  Mammogram _ /  / 

12.  Recommendation  for  mammogram  follow-up 

Dt  Routine  follow  up  interval  Months: _ _ 

Dt  Short  term  follow  up  Months: _ 

13.  Recommendation  for  Immediate  work-up 

(check  all  that  apply) 

Di  Additional  views 
Dt  Ultrasound 
DiFNA 

Di  Core  needle  biopsy 
Di  Surgical  biopsy 
Dt  MRI 

Dt  Surgical  or  clinical  consult 

□1  Other  Immediate  workup: _ _ 

®  © 


Signature 

O  clump  « scar 

#  =  mole  X  cpaln 
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October  15,  1998 


Commander 

U.S.  Army  Medical  Research  and  Material  Command 
ATTN:  MCMR-RMI-S 
504  Scott  Street 

Fort  Detrick,  Maryland  21702-5012 


Dear  Commander, 

Enclosed  you  will  find  the  original  and  three  copies  of  the  annual  report  for  project  grant  #  DAMD 17-96-1- 
6288,  “Reaching  Rural  Mammographers  for  Quality  Improvement.  Also  enclosed  is  a  floppy  diskette  with 
the  text  of  the  report  saved  in  ASCII  format. 

If  you  have  any  questions,  please  contact  Dr.  Nicole  Urban,  Principal  Investigator,  by  telephone  at  206-667- 
5 1 2 1  or  by  email  at  nurban@fhcrc.org. 


Sincerely, 


Project  Manager 


Enc. 


1 1 00  Fairview  Ave.  N.,  MP  702  Seattle,  Washington  981 09-1024  (206)  667-4678 


